Estimating latent feature-feature interactions in large feature-rich graphs
نویسندگان
چکیده
Complex networks arising in nature are usually modeled as (directed or undirected) graphs describing some connection between the objects that are identified with their nodes. In many real-world scenarios, though, those objects are endowed with properties and attributes (hereby called features). In this paper, we shall confine our interest to binary features, so that every node has a precise set of features; we assume that the presence/absence of a link between two given nodes depends on the features that the two nodes exhibit. Although the situation described above is truly ubiquitous, there is a limited body of research dealing with large graphs of this kind. Many previous works considered homophily as the only possible transmission mechanism translating node features into links: two nodes will be linked with a probability that depends on the number of features they share. Other authors, instead, developed more sophisticated models (often using Bayesian Networks [29] or Markov Chain Monte Carlo [20]), that are indeed able to handle complex feature interactions, but are unfit to scale to very large networks. We study a model derived from the works of Miller et al. [46], where interactions between pairs of features can foster or discourage link formation. In this work, we will investigate how to estimate the latent feature-feature interactions in this model. We shall propose two solutions: the first one assumes feature independence and it is essentially based on a Naive Bayes approach; the second one consists in using a learning algorithm, which relaxes the independence assumption assumption and is based on perceptron-like techniques. In fact, we show it is possible to cast the model equation in order to see it as the prediction rule of a perceptron. We analyze how classical results for the perceptrons can be interpreted in this context; then, we define a fast and simple perceptron-like algorithm for this task. This approach (that we call Llama, Learning LAtent feature-feature MAtrix) can process hundreds of millions of links in minutes. Our experiments show that our approach can be applied even to very large networks. We then compare these two techniques in two different ways. First we produce synthetic datasets, obtained by generating random graphs following the model we adopted. These experiments show how well the Llama algorithm can reconstruct latent variables in this model. These experiments also provide evidence that the Naive independence assumptions made by the first approach are detrimental in practice. Then we consider a real, large-scale citation network where each node (i.e., paper) can be described by different types of characteristics. This second set of experiments confirm that our algorithm can find meaningful latent feature-feature interactions. Furthermore, our framework can be used to assess how well each set of features can explain the links in the graph.
منابع مشابه
An Overview of the New Feature Selection Methods in Finite Mixture of Regression Models
Variable (feature) selection has attracted much attention in contemporary statistical learning and recent scientific research. This is mainly due to the rapid advancement in modern technology that allows scientists to collect data of unprecedented size and complexity. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a sma...
متن کاملOverlap-based feature weighting: The feature extraction of Hyperspectral remote sensing imagery
Hyperspectral sensors provide a large number of spectral bands. This massive and complex data structure of hyperspectral images presents a challenge to traditional data processing techniques. Therefore, reducing the dimensionality of hyperspectral images without losing important information is a very important issue for the remote sensing community. We propose to use overlap-based feature weigh...
متن کاملIdentification of Genetic Polymorphism Interactions in Sporadic Alzheimer’s Disease Using Logic Regression
Objectives: Genetic polymorphism interactions are among the important factors in affliction with complex diseases like Alzheimer’s disease. The important goal of genetic association studies is to identify a combination of polymorphisms and measure their importance in increasing the risk of occurrence of such diseases. In this study, feature selection approach of logic regression was used to ide...
متن کاملFeature Extraction and Efficiency Comparison Using Dimension Reduction Methods in Sentiment Analysis Context
Nowadays, users can share their ideas and opinions with widespread access to the Internet and especially social networks. On the other hand, the analysis of people's feelings and ideas can play a significant role in the decision making of organizations and producers. Hence, sentiment analysis or opinion mining is an important field in natural language processing. One of the most common ways to ...
متن کاملFeature selection using genetic algorithm for classification of schizophrenia using fMRI data
In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1612.00984 شماره
صفحات -
تاریخ انتشار 2016